Information Divergence is more x 2 -distributed than 

the x 2 -statistics 
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Abstract — For testing goodness of fit it is very popular to use 
either the x 2 -statistic or G 2 -statistics (information divergence). 
Asymptotically both are x 2 -distributed so an obvious question is 
which of the two statistics that has a distribution that is closest to 
the x 2 -distribution. Surprisingly the distribution of information 
divergence is much better approximated by a x 2 -distribution 
than the x 2 -statistic. For random variables we introduce a new 
transformation that transform several important distributions 
into new random variables that are almost Gaussian. For the 
binomial distributions and the Poisson distributions we formulate 
a general conjecture about how close their transform are to 
the Gaussian, and this conjecture gives much more accurate 
estimates of the tail probabilities of these distributions than 
previously published results. The conjecture is proved for Poisson 
distributions. 

I. Choice of statistic 

We consider the problem of testing goodness-of-fit in a 
discrete setting. Here we shall follow the classic approach to 
this problem as developed by Pearson, Neuman and Fisher. 
The question is whether a sample with observation counts 
(Xi, X2, ■ ■ ■ , Xk) has been generated by the distribution 
Q = (qi,q2, . . . , qk) ■ We introduce the empirical distribution 
P = ^) where n denotes the sample size 

n = Xi + X2 + ■ ■ ■ + Xk ■ Often one uses one of the Csiszar 
H] /-divergences 



Df (P,0) = £qjf 



(1) 



The null hypothesis is accepted if the test statistic D j ^P, Q^j 

is small and rejected if Df ^P, Q^j is large. Whether 

D f ^P, Q^j is considered to be small or large depends on the 
significance level ||2). The most important cases are obtained 
for the convex functions f(t) = n(t — l) 2 and f(t) = 2nt\nt 
leading to the Pearson x 2 -statistic 
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or the likelihood ratio statistic 
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Fig. 1. Quantile plot a x 2 -distribution against the distribution of the G 2 - 
statistic for a symmetric binomial distribution with n = 51. The midpoint of 
each step is marked. The red line marks the identity. 



which is a scaled version of information divergence that we 
will denote D without subscript. For simplicity we shall focus 
our attention on the case where there are only two bins. 

Notation 1: Please note that we follow the notation from 
[3 1 by denoting the likelihood ratio statistic by G 2 rather than 
G as done in many textbooks and articles. Our G 2 should not 
be confused with Getis-Ord's G statistic Bl . 

One way of choosing between various statistics is by 
computing their asymptotic efficiency. In 1985 it was proved 
that the G 2 -statistic is more efficient in the Bahadur sense than 
the x 2 -statistic, and this result has been extended in a number 
of papers f5)-|l9]. The asymptotic Bahadur efficiency of G 2 
implies that a much smaller sample size is needed when using 
G 2 than when using x 2 if a fixed power should be achieved 
at a very small significance level for some alternative. Since 
this type of result only holds asymptotically for large sample 
sizes it may be difficult to use for a specific finite sample 
size. Therefore we will turn our attention to another important 
property for the choice of statistic. 

For the practical use of a statistic it is important to calculate 
or estimate the distribution of the statistic. This can be done 
by exact calculation, by approximations, or by simulations. 
Exact calculations may be both time consuming and difficult. 
Simulation often requires statistical insight and programming 
skills. Therefore most statistical tests use approximations to 
calculate the distribution of the statistic. For a fixed number 




Fig. 2. Quantile plot a x 2 -distribution against the distribution of the \ 2 ~ 
statistic for a symmetric binomial distribution with n = 51. The red line 
marks the identity. 



of bins the distribution of the x 2 -statistic becomes closer 
and closer to the \ 2 -distributions as the sample size tends to 
infinity. For a large sample size the empirical distribution will 
with high probability be close to the generating distribution 
and the Csiszar /-divergence Df can be approximated by a 
scaled version of the x 2 -statistic 
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Therefore the distribution of any /-divergence may be approx- 
imated by a scaled x 2 -distribution, i.e. a T-distribution. From 
this argument one might get the impression that the distribution 
of the x 2 -statistic is closer to the \ 2 -distribution. Figure Q] 
and Figure |2] show that this is far from the the case. Figure 
Q] shows that the G 2 -statistic is almost as \ 2 -distributed as it 
can be taking into account that the likelihood ratio statistic has 
a discrete distribution. Each step is intersected very close to 
its midpoint. Figure [2] shows that the distribution of the \ 2 ' 
statistic deviates systematically from the x 2 -distribution for 
small significance levels. These two plots show that at least in 
some cases the distribution of the G 2 -statistic is much closer to 
a x 2 -distribution than Pearson statistic is. The next question is 
whether there are situations where the likelihood ratio statistic 
is not approximately x 2 -distributed. For binomial distributions 
that are very skewed the intersection property of Figure [T] is 
not satisfied when the G 2 -statistic is plotted against the % 2 - 
distribution so in the rest of this paper a different type of plots 
will be used. 

The use of the G 2 -statistic rather than the x 2 -statistic has 
become more and more popular since this was recommended 
in the 1981 edition of the popular textbook of Sokal and 
Rohlf iflOl . T. Dunning 1111 has made a summery of what 
the typical recommendations are about whether one should 
use the x 2 -statistic or the G 2 -Statistic. The short version is 
that the statistic is approximately x 2 -distributed when each 
bin contains at least 5 observations or the calculated variance 
for each bin is at least 5, and if any bin contains more than 
twice the expected number observations then the G 2 -statistic 
is preferable to the x 2 -statistic. Our result is that the G 2 - 



statistic approximately x 2 -distributed with no restrictions on 
the number of expected observations in each bin. 

Notation 2: In the rest of this paper we will let r denote 
the circle constant 2tt and let cf> denote the standard Gaussian 
density 

exp (-4) 
^172 ' 

We let $ denote the distribution function of the standard 
Gaussian 



$ (f) = / ( z ) dz . 

J — oc 

II. The G-transform and its distribution 

Here we shall introduce a transformation that is useful for 
our understanding of the fine structure of the distribution 
of the likelihood ratio statistics. Consider a 1 -dimensional 
exponential family Pg where 
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and Z denotes the partition function given by 



J exp (/3 • x) dP x . 



Let P M denote the element in the exponential family with 
mean value /i. Let denote the mean value of Pq. Then 
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Definition 3: Let X be a random variable with distribution 
Pq. Then the G-transform G (X) of X is the random variable 
given by 



G(x) 



~(2D(P*\\P )) 1/2 , foTx< m; 
{2D(P*\\P )) 1/2 , forx> Mo . 



With this definition we have 



4>(G(x)) 



- 1/2 ^(-)' 



Now we can make quantile plots between the normal 
distribution and the distribution of the G-transform of various 
random variables. On Figure 3-7 the G-transform of some 
binomial and Poisson distributions are compared with the 
standard Gaussian via their quantile transform. In these plots 
the green line correspond to the large deviation bound. 
These plots support the following conjecture: 
Conjecture 4 (The intersection property): Let M denote a 
binomial distributed or Poisson distributed random variable 
and let G (A/) denote the G-transform of M. The quantile 
transform between G (M) and a standard Gaussian Z is a 
step function and the identity function intersects each step, 
i.e. 

P (M < m) <P(Z <G (m)) < P (M < m) 
for all integers m. 




Fig. 3. Symmetric binomial distribution with n = 30. 




Fig. 4. 
30. 



Binomial distribution with success probability equal to 0.3 and n 




Fig. 5. Binomial distribuiton with success probability equal to 0.1 and n=3(). 




Fig. 6. Poisson distribution with mean equal to 20. 



Another way of reformulating the intersection property is 
that in the stochastic ordering X should be less than a random 
variable with point probabilities $ (G (to)) — <f> (G (m — 1)) 
and greater than a random variable with point probabilities 
$ (G (m + 1)) - $ (G (m)) , where G (-1) is defined as — oo 
and G (n+ 1) is defined to be oo for a binomial distribution 
number parameter n. The conjecture is so well supported by 
numerical calculations that we would not hesitate to recom- 
mend it for estimation of tail probabilities for the binomial 
distributions in goodness of fit tests instead of using the usual 
^-approximation of the x 2 -statistic. 

As we see both skewed binomial distributions and Poisson 
distributions have different step sizes for positive and negative 
values. Although the quantile transform between G (M) and 
a standard Gaussian has the intersection property interference 
between the step sizes may have the effect that the quantile 
transform between the G 2 -statistic and the % 2 -distribution does 
not necessarily have the intersection property. 

III. The link to waiting times 

Hitherto we have discussed inequalities for discrete distri- 
butions but there is an interesting link to inequalities for con- 
tinuous distributions associated with waiting times. Assume 
that M is Poisson distributed with mean t and T is Gamma 
distributed with shape parameter to and scale parameter 1, i.e. 
the distribution of the waiting time until m observations from 
an Poisson process with intensity 1. Then 



P (M >m)=P(T <t). 
The Gamma distribution T (to, 9) has density 
1 1 
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so the divergence can be calculated as 
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If Gp is the G-transform for Po{t) and Gr is the G- 
transform for T (to, 1) then Gp (to) = — Gr if) ■ This shows 
that if the G-transforms of the Gamma distributions are close 
to a Gaussian then so are the G-transforms of the Poisson 
distributions. Figure|7]shows the G-transform of some Gamma 
distributions. 

We see that the fit with a straight line of slope 1 is extremely 
good. The point (0,0) is not on the line reflecting the fact that 
the means and the medians of the Gamma distributions do 
not coincide. In the next section we shall see that the quantile 
transform between a Gaussian and the G-transform of Gamma 
distributions is always below the identity. 




Fig. 7. Quantile transform between a standard Gaussian and T(l, 1) (black), 
r(3, 1) (blue), r(20, 1) (yellow), and another standard Gaussian (green). The 
red curves are the large deviation bounds. 



IV. The increasing property 

In this section we shall formulate some conditions that 
are stronger than the intersection property. The proof of the 
following lemma is an easy exercise so we omit the proof. 

Lemma 5: Let fx and f 2 be the densities of the random 
variables X\ and X 2 with respect to some measure p on the 
real numbers. If 

h 

is an increasing then X\ is less than X 2 in the usual stochastic 
ordering. 

Theorem 6: The G-transform of a Gamma distributed ran- 
dom variable is less than a standard Gaussian in the stochastic 
ordering. 

Proof: Let T be a Y (m, 1) distributed random variable. 
The distribution in the exponential family based Y (m, 1) with 
mean t is T (to, — ) . The G-transform is 



a (I) = ± [ 2D ( r ( m,— ) ||r(m,i) 



1/2 



where ± means that we will use + when t is greater than the 
mean k and use — when t is less than m. For the Gamma 
distribution we have 



dr (m, ± 

dr(m, 1) 
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Let W = G (T) with density / (w) . We want to prove that 
j|^y is increasing. Now 
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Hence we want to prove that tG' (t) is increasing. 



2D 1 (t) 



1/2. 



tG' (t) = ±t ^ = ±TO 
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With the substitution u = t/m we have to prove that 
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is increasing. We have 
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du \ (2(u- 1 -lnu)) 1/2 / (2(u- 1 -lnu)) 3/2 
so we want to prove that 

± - 2\nu - > 0. 

Now we have to prove that £ (u) = u — 2 In u — — is positive 
for u > 1 and negative for u < 1. Obviously £ (1) = so it 
is sufficient to prove that £' (u) > 0, but 
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Next we shall formulate an even stronger conjecture and see 
that it actually implies that binomial distributions and Poisson 
distributions have the intersection property. 

Conjecture 7 (The increasing property): If M is a binomi- 
ally or Poisson distributed random variable with G-transform 
G (M) then 

P (M = to) 



TO 



is increasing and 



in 



$(G(to+1))-$(G(to))' 



P(M 



(5) 



$(G (to)) -$(G(m- 1)) 
is decreasing. 

The conjecture is supported by numerous numerical com- 
putations. If it holds the intersection property follows by 
Lemma [5] The increasing property implies log-concavity of 
the distribution but for instance the geometric distribution 
is log-concave but does not satisfy the intersection property. 
We have some indications that the conjecture also holds for 
any distribution of a sum of independent Bernoulli random 
variables. 

Theorem 8: The intersection property is satisfied for any 
Poisson random variable. 

Proof: (Sketch) The inequality 

P (M <m)<P(Z <G (to)) 
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Fig. 8. Plot of the logarithm of (4) for a symmetric binomial distribution 
with n = 50. 



follows from Theorem|6]combined with Equation|4] For values 
of m < 5 the inequality 

P (M <m)>P(Z<G (to)) 

can be be proved case by case, but the complexity of the proof 
increases with m. The simplest case is m = where the 
inequality translates into the following well-known tail bound 
for the Gaussian distribution 

$ (z) < exp 

for z < 0. 

Following the ideas above it is sufficient to prove that 

P(M = m) 

Ig(^-i) < / > ( z ) dz 
is decreasing or, equivalently that 

i G (m-l) <l> ( Z ) ^ p ( M = m ) 



By a substitution the left hand side can be written as 

Jg(™-i)0( z ) dz 

r G(m+l) 



Jo (G (m - 1 + s)) G' (m - 1 + s) ds 
jl4>{G(m + s))G'{m + s) ds 
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and the right hand side equals (m + 1) ft. Therefore it is 
sufficient to prove that 

cj)(G(m-l + s)) G'(m-l + s) m + 1 
<j){G{m + s)) G' (m + s) " t ' 

The first factor can be rewritten as 
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This factor is maximal for s = 1 where it equals 
^^ii exp (to In (l + — ) — l) , so it is sufficient to prove that 



exp j m Id j 1 H ) — 1 



G" (to - 1 + s) 
G> (to -I- s) 



< 1. 
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The factor is decreasing in t and in s so we may assume that 
t = 1 and s = 0. With these assumptions Inequality [7] is 
proved for to > 5 by standard analytic methods. 

For t < 1 a slightly more accurate (but more extensive) 
bound on <JSJ can be used to bound it by for to > 1. ■ 

Theorem ([8]l gives bounds on the tail probabilities for 
Poisson distributions that are far better than what can be found 
in the literature (see for instance [12|). At the same time the 
theorem gives bounds on the median that are compatible with 
the bounds in the literature Q~3|, |14|. 
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